Tweets Classification using Corpus Dependent Tags, Character and POS N-grams
نویسندگان
چکیده
This paper is part of the Author Profiling task at PAN 2015 contest; in witch participants had to predict the gender, age and personality traits of Twitter users in four different languages (Spanish, English, Italian and Dutch). Our approach takes into account stylistic features represented by character Ngrams and POS N-grams to classify tweets. The main idea of using character Ngrams is to extract as much information as possible that is encoded inside the tweet (emoticons, character flooding, use of capital letters, etc.). POS N-grams were obtained using Freeling and certain token were relabeled with Twitter dependent tags. Obtained results were very satisfactory; our global ranking score was of 83.46%.
منابع مشابه
Efficient Social Network Multilingual Classification using Character, POS n-grams and Dynamic Normalization
In this paper we describe a dynamic normalization process applied to social network multilingual documents (Facebook and Twitter) to improve the performance of the Author profiling task for short texts. After the normalization process, n-grams of characters and n-grams of POS tags are obtained to extract all the possible stylistic information encoded in the documents (emoticons, character flood...
متن کاملA POS Tagger for Code Mixed Indian Social Media Text - ICON-2016 NLP Tools Contest Entry from Surukam
Building Part-of-Speech (POS) taggers for code-mixed Indian languages is a particularly challenging problem in computational linguistics due to a dearth of accurately annotated training corpora. ICON, as part of its NLP tools contest has organized this challenge as a shared task for the second consecutive year to improve the state-of-the-art. This paper describes the POS tagger built at Surukam...
متن کاملSimplified Feature Set for Arabic Named Entity Recognition
This paper introduces simplified yet effective features that can robustly identify named entities in Arabic text without the need for morphological or syntactic analysis or gazetteers. A CRF sequence labeling model is trained on features that primarily use character n-gram of leading and trailing letters in words and word n-grams. The proposed features help overcome some of the morphological an...
متن کاملپیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی
Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...
متن کاملPositive, Negative, or Neutral: Learning an Expanded Opinion Lexicon from Emoticon-Annotated Tweets
We present a supervised framework for expanding an opinion lexicon for tweets. The lexicon contains part-of-speech (POS) disambiguated entries with a three-dimensional probability distribution for positive, negative, and neutral polarities. To obtain this distribution using machine learning, we propose word-level attributes based on POS tags and information calculated from streams of emoticonan...
متن کامل